A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates
نویسندگان
چکیده
Data mining algorithms generally assume that data will be clean and consistent. However, in practice, this is not always the case, and for this reason the detection and elimination of duplicate records is an important part of data cleaning. The presence of similar-duplicate records causes over-representation of data. If the database contains different representations of the same data, the results obtained from the data mining algorithm will be erroneous. The detection of similar-duplicate records is a difficult task, especially when the records are domain-independent. In this paper, we propose a novel domain-independent technique for better reconciling the similar-duplicate records. We also introduce new ideas for making similar-duplicate detection algorithms faster and more efficient. In addition, a significant modification of the transitivity rule is also proposed. Finally, we propose an algorithm that incorporates all these techniques for similar-duplicate detection into a domain-independent environment. The performance of the proposed method has been compared to other methods and the superiority of the proposed method has been confirmed by the experimental results.
منابع مشابه
Using well defined tokens in similarity function for record matching in data cleaning techniques
The integration of information is an important area of research in databases. The duplicate elimination problem of detecting database records that are approximate duplicates, but not exact duplicates, which describe the same real world entity, is an important data cleaning problem. To ensure high data quality, data warehouse must cleanse data by detecting and eliminating the redundant data. Dur...
متن کاملEliminating Fuzzy Duplicates in Data Warehouses
1 Work done while visiting Microsoft Research Abstract The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches ...
متن کاملData Position and Profiling in Domain-Independent Warehouse Cleaning
A major problem that arises from integrating different databases is the existence of duplicates. Data cleaning is the process for identifying two or more records within the database, which represent the same real world object (duplicates), so that a unique representation for each object is adopted. Existing data cleaning techniques rely heavily on full or partial domain knowledge. This paper pr...
متن کاملA New Method for Duplicate Detection Using Hierarchical Clustering of Records
Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...
متن کاملIndependent De - Duplication in Data Cleaning #
Many organizations collect large amounts of data to support their business and decision-making processes. The data originate from a variety of sources that may have inherent data-quality problems. These problems become more pronounced when heterogeneous data sources are integrated (for example, in data warehouses). A major problem that arises from integrating different databases is the existenc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JCP
دوره 5 شماره
صفحات -
تاریخ انتشار 2010